Definition

Arthur Samuel

(1959)
Field of study that gives computers the ability to learn without being explicitly programmed

Tom Mitchell

(1998)
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P, improves with E

Types of Learning Algorithms

Source for further reading

Supervised learning

Supervised learning algorithms make predictions based on a set of examples. For example, historical sales can be used to estimate the future prices. With supervised learning, you have an input variable that consists of labeled training data and a desired output variable. You use an algorithm to analyze the training data to learn the function that maps the input to the output. This inferred function maps new, unknown examples by generalizing from the training data to anticipate results in unseen situations.

Classification
When the data are being used to predict a categorical variable, supervised learning is also called classification. This is the case when assigning a label or indicator, either dog or cat to an image. When there are only two labels, this is called binary classification. When there are more than two categories, the problems are called multi-class classification.

Regression
When predicting continuous values, the problems become a regression problem.

Forecasting
This is the process of making predictions about the future based on the past and present data. It is most commonly used to analyze trends. A common example might be estimation of the next year sales based on the sales of the current year and previous years.

Semi-supervised learning

The challenge with supervised learning is that labeling data can be expensive and time consuming. If labels are limited, you can use unlabeled examples to enhance supervised learning. Because the machine is not fully supervised in this case, we say the machine is semi-supervised. With semi-supervised learning, you use unlabeled examples with a small amount of labeled data to improve the learning accuracy.

Unsupervised learning

When performing unsupervised learning, the machine is presented with totally unlabeled data. It is asked to discover the intrinsic patterns that underlies the data, such as a clustering structure, a low-dimensional manifold, or a sparse tree and graph.

Clustering
Grouping a set of data examples so that examples in one group (or one cluster) are more similar (according to some criteria) than those in other groups. This is often used to segment the whole dataset into several groups. Analysis can be performed in each group to help users to find intrinsic patterns.

Dimension reduction
Reducing the number of variables under consideration. In many applications, the raw data have very high dimensional features and some features are redundant or irrelevant to the task. Reducing the dimensionality helps to find the true, latent relationship.

Reinforcement learning

Reinforcement learning analyzes and optimizes the behavior of an agent based on the feedback from the environment. Machines try different scenarios to discover which actions yield the greatest reward, rather than being told which actions to take. Trial-and-error and delayed reward distinguishes reinforcement learning from other techniques.

Learning Theory
Study of how and why (mathematically) a learning algorithm works

Concepts

Feature

Anything in the world that we believe to have any impact on the value we are trying to predict (and can be measured).

Sample

A sample is a collection of values given to the features for a single result (predicted or known).
If we are trying to predict how many hairs you have on your head basd on Age, Weight, Gender, Niegborhood, Soccer Team.
Each person we measure the Age, Weight, ... is a sample.
Each sample is a vector and we write it as $X=[x_1, x_2, ..., x_n]$
To list the samples we use an upper index and write them as a column vector:

\begin{align} \begin{bmatrix} X^0 \\ X^1 \\ \vdots \\ X^m \end{bmatrix} &= \begin{bmatrix} [x_1^0,x_2^0,...,x_n^0] \\ [x_1^1,x_2^1,...,x_n^1] \\ \vdots \\ [x_1^m,x_2^m,...,x_n^m] \\ \end{bmatrix} \end{align}

Scaling and Normalization

In our example, gender: $g\in[0,1]$ and weight: $wgt\in[20,180]$.
Scaling is the action of making the magnitude of all features similar.
A good normalization method would be:

\begin{equation*} x_n = \frac{x-x_m}{x_M-x_m} \end{equation*}

where

  • $x_n$: normalized x
  • $x_m$: the minimum value x can get
  • $x_M$: the MAXimum value x can get

Target

In the case of suppervised learning, the target or label is the actual result for a sample.
For Age=45, Weight=90, Gender=Female, Neigborhood=Villa Tachito, Soccer Team=Boca Jrs the Target would be 100 hairs on the head.
We symbolyze the targets for all our samples with vector $Y$:

\begin{align} \begin{bmatrix} y^0 \\ y^1 \\ \vdots \\ y^m \end{bmatrix} &= Y \end{align}

Hypothesis

$H_\theta(X) = b + \theta_1*x_1 + ... + \theta_k*x_k=y$
Where $X=[x_i]$ is a single sample. The features/parameters we use to predict a result. Age, weight, gender, neigborhood, etc.
$\theta_i$ are the weights we give to each parameter. These are the values we change during learning ittereations to improve the accuracy of our estimation. $b$ is the bias, a value we use to fine tune the hypothesis. We usually use $\theta_0$ and $x_0=1$ so we get $b=\theta_0*x_0$
$y$ is the value we predicted using the sample $X$ and the weights $\theta_i$.
So we get:

\begin{align} \begin{bmatrix} X^0 \\ X^1 \\ \vdots \\ X^m \end{bmatrix} &. \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix} &= \begin{bmatrix} [x_0^0,x_1^0,...,x_n^0] \\ [x_0^1,x_1^1,...,x_n^1] \\ \vdots \\ [x_0^m,x_1^m,...,x_n^m] \\ \end{bmatrix} &. \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix} &= \begin{bmatrix} h_{\theta}(x^1) \\ h_{\theta}(x^2) \\ \vdots \\ h_{\theta}(x^m) \\ \end{bmatrix} \end{align}

Cost Function

So, the hypothesis predicts a value on one hand, on the other we have the real value for a sample (in the training set).
There is a diffrence between the hypothesis and the target value.
The cost function defines a metric to decide what is the distance between the target and the hypothesis.
The most trivial cost function would be RMS (root mean square) but it is not always the best
The process of learning, in many algorithms, concists on reducing the cost function. It is done by calculating the hypothesis for a value, comparing it with the target, updating the weights and trying again. Doing this until the cost function reaches a minimum.
Note, the cost function is a function of the weights $\theta_i$ and not the samples.
For example:

A

Note:
We ussually mark with $n$ the number of features and with $m$ the number of samples

Hyperparametes

These are the parameters we use to conifugre the algorithm itself. Even-though the hyperparams are not a function of the features or their values, they depend on the values.

Looking for the set of hyperparameters that turn to be the best for our use of the algorithm. There are many ways to perform Grid Search.
One popular Grid Search method is called k-folds. In which we take the training data, split it into a training set and a testing set.
One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to estimate a final predictive model. Wikipedia

A

Confusion Matrix

A

Under/Over Fitting

Underfitting occurrs when the trained model is too general, meaning it would predict partially well, but would miss many samples due to a bit higher variance.
Overfitting happens when our trained model is too specific to the training data. Meaning it will only predict correctly data that is too similar to the training set and would miss predictions on a more generalized data set.

A

Dimension Reduction

There are problems which involve hundreeds of features. Dimension reduction is the action of defining new, less, features that are enough to describe and predict our target.
Dimension reduction is useful for data visualization, to ease the understanding of the data. It can also be useful to improve the prediction performance. There are many methods, here I state two. PCA and tSNE.

A

PCA

Principal Component Analysis consists on looking for the features with the highest variance and using them as the new reduced features.

tSNE

t-Distributed Stochastic Neighbor Embedding is a recursinve method to reduce dimensions by searching for near neighboors.
Interesting Note
Even-though you can embed the data into any number of dimensions, it won't necesarily improve the embedding as one would expect.

Beautiful hands on demo.
This is the page where the method inventor centralizes all info about it.
This Demo is much easier to understand example. Taken lots of texts, the tSNE decides which words are related to each other (or photos).

Ensembles

Creating ensembles involves aggregating the results of different models.

Some Popular Algorithms

Linear Regression

Alt

  • The outcome (the value we predict) is continuous
    For any given of the independent paramter X we can predict any value for the predicted value Y
    Given the RPM value of the engine what would be the power we could expect.
  • Example
    Diabetes dataset
    Ten baseline variables: age, sex, body mass index, average blood pressure and six blood serum measurements.
    The values were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.
    Data Set Characteristics
    Number of Instances: 442
    Number of Attributes:
    First 10 columns are numeric predictive values.
    Target: Column 11 is a quantitative measure of disease progression one year after baseline
    Attributes:
    Age, Sex, Body mass index, Average blood pressure, S1, S2, S3, S4, S5, S6
    Note:
    Each of these 10 feature variables have been mean centered and scaled by the standard deviation times n_samples (i.e. the sum of squares of each column totals 1).
    Source URL
In [ ]:
# print(__doc__)


# Code source: Jaques Grobler
# License: BSD 3 clause


import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# display(diabetes.DESCR)


# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xlabel('Patient Weight')
plt.ylabel('1 Year Disease Progression')

plt.xticks(())
plt.yticks(())

plt.show()

To think together Let's say we are a real estate agent in Alaska.
We want to predict the price of an Igloo given it's radius.
We believe there is a linear relation between the area of the Igloo and it's price
The thing is the area of the igloo is proportional to $r^2$
How do we find a linear regression to predict igloo price based on its radius?

Logistic Regression

  • The outcome (the value we predict) is discrete
    For any given of the independent paramter X we can predict one of a few options for prediction Y
    Given the grades and sleeping culture of the student would he pass or not the test.
  • Approach
    1. Define an hypothesis built of the parameters
    2. Build Cost Function
      The distance between the hyptesis and the true value
    3. Define your metrics (How to decide how
    4. Reduce Cost Function
      1. Iterate the Trating Set
      2. Udate hypotesis weights
    5. Validate your final hypotesis on the Testing Set
  • Example
    Change a linear regression to logistic by asking 'Would this igloo cost more than 200k?

Desicion Tree

In a desicion tree you lay out options for each feature and investigate the possible outcomes of choosing those options.
The idea of a decision tree is to divide the data set into smaller data sets based on the descriptive features until you reach a small enough set that contains data points that fall under one label/target. From this explanation

A

Continuous Features

In the case you have many options for a feature (or it's a continuous feature such as time) you slit the possible values into cathegories or boundaries and test the values against this quantization.

Advantages

  • Easy to implement (relatively)
  • Do not require big data for training
  • There is no need to scale or normalize the data.

Disvantages

  • Likely to overfit noisy data (data with high variance, with fluctuations around values)

Pruning

A method to reduce the tree depth by detecting branches and paths that do not contribute to the outcome.
Prunning is espcially useful to reduce overfitting (but not only)

Notes

  • A feature value can be asked more than once (but it's not very common)
  • The order in which you ask the questions may have impact on the prediction.
  • Gini Index:
    A sub class of decision trees are the binary decision trees, which each node has only two exits.
    The impurity (or purity) measure used in building decision tree in CART is Gini Index. The decision tree built by CART algorithm is always a binary decision tree (each node will have only two child nodes)
    CART algorithm: Classification And Regression Tree

Random Forest

Random forest is an ensemble of a few Desicion trees.

  • Bagging
    has a single parameter, which is the number of trees. All trees are fully grown binary tree (unpruned) and at each node in the tree one searches over all features to find the feature that best splits the data at that node.
  • Random Forests
    have 2 parameters
    • The first parameter is the same as bagging (the number of trees)
    • The second parameter (unique to randomforests) is mtry which is how many features to search over to find the best feature. this parameter is usually $\frac{1}{3}*D$ for regression and $\sqrt(D)$ for classification. thus during tree creation randomly mtry number of features are chosen from all available features and the best feature that splits the data is chosen.
  • Bootstraping
    Multiple decision trees each trained on a different sample of the data.
    Because bootstrapping involves sampling with replacement, some of the data in the sample is left out of each tree.

kNN

k Nearest Neighbors
The idea is to predict the label of a sample by looking at the nearest samples in the training data.
The distance is easy to imagine on two or three dimensions. In more dimensions you need a metric to define distance between samples. For example: Hamming distance, Large Margin Nearest Neighbor or Neighbourhood components analysis.
$k$ is the number of closest samples to test the value. If you have two classes, $k$ should be odd to avoid tie.
In a more general speaking, $k$ should not be a product of the number of classes.
Look at the impact of the value you chose for $k$: a
What happens when k=3 and when k=5?
A drawback of the basic "majority voting" classification occurs when the class distribution is skewed
Other drawback is that to find the closest samples you need to calculate the distances to lots of samples, which may be very time consuming for large $D$

SVM

Support Vector Machine
a

SVM detects a road between the classes of the samples. Then predicts by deciding on which side of the road is the sample.
It doesn't just detect the road, but the widest possible road: a

Hyperplane
Is the center of the road the separates between the classes. The hyperplance is always one dimnesion less than the number of features, the dimension of the data. a

Kernel
What happens when the data cannot be separated by a linear hyperplane?
The kernel is the function that determines the shape of the hyperplane: a
in this example the best choice for the kernel would be RBF, Radial Basis Function.
There are a few standard kernels. Yet, you can define any function to be your kernel. If you go deeper into Kernel you'll see it is not necesary to solve the function thus your can chose a very nasty kernel if it suits you data shape.

Feature Space vs Input Space
These two become confusing when discussing SVMs. Even they are similar they are different, Same Same but Different...
Feature space refers to the n-dimensions where your variables live (not including a target variable, if it is present). The term is used often in ML literature because a task in ML is feature extraction, hence we view all variables as features. For example, consider the data set with:

  • Target
    • y≡ Thickness of car tires after some testing period
  • Variables
    • $x_1$≡ distance travelled in test
    • $x_2$≡ time duration of test
    • $x_3$≡ amount of chemical C in the tires

The feature space is $R^3$, or more accurately, in $R^{3+}$ (the positive quadrant) as all the $X$ variables can only be positive quantities. Domain knowledge about tires might suggest that the speed the vehicle was moving at is important, hence we generate another variable, $x_4$ (this is the feature extraction part):

$x_4≡\frac{x_1}{x_2}$ the speed of the vehicle during testing.

Mapping the input space into feature space.
The input is mapped into variables through the function $\phi$, from $R^3$ to $R^4$:

$\phi(x_1,x_2,x_3)=(x_1,x_2,x_3,\frac{x_1}{x_2})$

Example a

Using a Kernel to compute a non-linearly separable input into a higher dimension linearly separable features.

More
Here you can find an amazing lecture with the math behind SVMs.

Artificial Neural Networks

ANNs
Computing systems inspired by the biological neural networks of the brain. The performance progressively improves on tasks (i.e. Learns) by considering examples, generally without task-specific programming.
For example, in image recognition. The examples could be photos labeled as 'CAT' or 'NOT CAT'. After learning many examples the performance is measured by how many times it recognized correctly whether there is a cat or not.
The learning process should be generic (regarding the content of the images). In the process evolves a set of relevant characteristics from the learning material that is processed.

a

In common ANN implementations, the signal at a connection between neurons is a real number, and the output of each neuron is calculated by a non-linear function of the sum of its inputs. Neurons and connections typically have a weight that adjusts as learning proceeds.

Neuron

An ANN is based on a collection of connected units or nodes called neurons. Each connection between neurons transmits a value from one to another. The neuron that receives the signal can process it and then signal neurons connected to it.

Weight

Is the importance / relevance each input to neuron has. The weight values are changed on each learning itteration.

Activation

The inputs of one layer of neurons are connected to the inputs of the next layer.
What if we need the inputs of the neuron to comply with some limitations? To be only possitive, for example.
In such cases between layers we add an activation layer, which is, a layer that all that it does is transform the outputs of one layer to correct data for the next layer.

ReLU

The most popular activation function today is the Rectifier Linear Unit. What it does is to allow only possitive inputs:
$f(x) = x^+ = max(0,x)$
$x$ is the input to the neuron.

Softmax

Softmax is another popular activation function that reduces any real value to the close segment [0,1]: a

Deep Nets

Deep network is an ensemble of many layers of neural networks.
Deep nets have outstanding results when trained. The thing is they require huge data sets till they reach their potential performance.

Backprop

Back Propagation
To improve performance of a network, erros flow in the opposite direction to the data to correct each network weights.

Transfer Learning

Transfer learning or inductive transfer is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem.
For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.

Some Neural Network Examples

Feedforward

Artificial neural networks where the connections between units do not form a cycle. Feedfoward neural networks were the first type of artificial neural network invented and are simpler than their counterpart, recurrent neural networks. They are called feedforward because information only travels forward in the network (no loops), first through the input nodes, then through the hidden nodes (if present), and finally through the output nodes.
From this amazing website about mathematics.

a

Recurrent

The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea. If you want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps. Here is what a typical RNN looks like:

a A recurrent neural network and the unfolding in time of the computation involved in its forward computation.

  • $x_t$ is the input at time step t. For example, $x_1$ could be a one-hot vector corresponding to the second word of a sentence.
  • $s_t$ is the hidden state at time step t. It’s the “memory” of the network. $s_t$ is calculated based on the previous hidden state and the input at the current step: $s_t=f(U*x_t + W*s_{t-1})$. The function $f$ usually is a nonlinearity such as tanh or ReLU. $s_{-1}$, which is required to calculate the first hidden state, is typically initialized to all zeroes.
  • $o_t$ is the output at step $t$. For example, if we wanted to predict the next word in a sentence it would be a vector of probabilities across our vocabulary. $o_t = \mathrm{softmax}(V*s_t)$.

Taken from this greate tutorial

Example
LSTM Long Short Term Memory(Primitive version of RNNs) used to play music. Duet between computer and human.

Autoencoder

A neural network that is trained to attempt to copy its input to its output. Internally, it has a hidden layerh that describes a code used to represent the input. The network may be viewed as consisting of two parts: an encoder functionh=$f(x)$ and a decoder that produces a reconstructionr=$g(h)$. If an autoencoder succeeds in simply learning to set $g(f(x)) =x$ everywhere, then it is not especially useful. Instead, autoencoders are designed to be unable to learn to copy perfectly. Usually they are restricted in ways that allow them to copy only approximately, and to copy only input that resembles the training data. Because the model is forced to prioritize which aspects of the input should be copied, it often learns useful properties of the data.
From this online book

a

Convolutional

A neural network that uses filters to detect features in the input.
The CNNs are quiet complicated. A beatiful explanation can be found here.
Let's say we have a one bit image (black and white) a (this image is black where 1 and white where 0) Now we define a filter (AKA kernel or feature detector) a Now we scan our image with the filter and write how many times the 1's in the filter are also 1 in the image: a

The matrix formed by sliding the filter over the image and computing the dot product is called the Convolved Feature or Activation Map or the Feature Map. It is important to note that filters act as feature detectors from the original input image.
To see how different filters have different impact on images see this site.
Different filters detect different features:
a What you see is how two filters detect different features on the same image.
Now, once you have the features that were extracted from a labeled image, you can use them to look for them on an unlabled image and make desicions accordingly.

Example
A fun game demonstrating CNN use.

Unsupervised Learning

Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from "unlabeled" data.
Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm—which is one way of distinguishing unsupervised learning from supervised learning and reinforcement learning.

Applications

  • Behavioural Analysis
    Predict invdividual action based on collective beheaviour.
    Detect a driver is very likely to have an accident based on others in his area...
  • Anomality Detection
    Detect our driver is very likely to have an eccident, based on his own history till now...
    Detect entry or exit point of contracts or stocks holdings.
    Fraud Detection
  • Patterns Detection
    Learn DNA mapping by inputing DNA pictures, parameters and observations.
  • Density Estimation
    Building an estimation for an additional parameter based on the data we have.

Methods

  • Dimension Reduction
    When we need to detect which are the most useful inputs or features in our set.
  • Custer Analysis
    Look at the data values and detect grouping similarities, such as kNNs
  • Neural Networks
    • SOMs
      Self Organizing Maps produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, and is therefore a method to do dimensionality reduction. Self-organizing maps differ from other neural networks as they apply competitive learning as opposed to error-correction learning in supervised learning.
    • ART
      The primary intuition behind the Adaptinve Resonance Theory model is that object identification and recognition generally occur as a result of the interaction of 'top-down' observer expectations with 'bottom-up' sensory information. The model postulates that 'top-down' expectations take the form of a memory template or prototype that is then compared with the actual features of an object as detected by the senses. This comparison gives rise to a measure of category belongingness. As long as this difference between sensation and expectation does not exceed a set threshold called the 'vigilance parameter', the sensed object will be considered a member of the expected class

Reinforcement Learning

RL Learning algorithm based on taking actions in an environment so as to maximize some notion of cumulative reward.

Applications

The problem, due to its generality, is studied in many other disciplines

  • Game theory
  • Control theory
  • Operations research
  • Information theory
  • Simulation-based optimization
  • Multi-agent systems
  • Swarm Intelligence
  • Genetic algorithms

Example

a This is a collective problem the players need to solve together.

  • The players cannot talk to each other.
  • Each player has two buttons. Red and Blue.
  • They get points only when they tap on a button and nobody else taps a button of the same color.
  • They can get between 0.0 to 10.0 points
  • If they tap one after the other quickly, the reward grows on each tap.
  • They get more points as the successful taps are evenly distributed among all players (for 4 consecutive successful taps, the max points is get if each player tapped once)
  • If they both tap togeter, they get noting and the next time somebody taps the reward restarts from 0.1.

In the image above, thanks to P1 they got additional 1.1 points and thanks to P2 they got 3.7 points.

Parametrization

  • At each step t each agent (player)
    • Executes action $a_t$ (or not, if doesn't push any button)
    • Receives observation $o_t$ which could be how many points they got in step t
    • Receives a reward $rb_t$ if pressed the blue button and $rr_t$ if pressed the red button
  • The Environment
    • Receives action $a_t$
    • Emits observation $o_{t+1}$
    • Emits the rewards $rb_{t+1}$ and $rr_{t+1}$ to the players that took action

Real World Example
The ALOHA network problem

  • Couples of users want to use the same N network channels to transmit packets, without the hability to communicate with other user couples.
  • At each transmission quant, each users decides whether to transmit or not.
  • If a user succeeds to transmit a packet it gets an ACK from his couple.
  • If more than one use transmits over the same channel at the same time, we assume no message was transmitted. _ The Goal: Find a distributed method for maximizing network utilization

When the guys at Rephael used RL for solving the quest, they tried three approaches:

  • General Rewarding: All agents get rewards for successfull transmissions.
  • Private Rewarding: Only the agent that got the message through gets rewarded.
  • Fair General Rewarding: All agents get rewarded proportionally to the general number of transmitted packages (like in our game).

The Results

a For General Reward, since everybody go the same points, the optimal procedure was that two couples transmit all the time and all the rest shut up. (Comunism doesn't work in channel optimization)

a When agents were privately rewarded all participants had to find how to devide the time slots so everybody can transmit. (Homo homini lupus - i.e. Adam le Adam Zeev)

a When the reward was Proportional to the messages all participants sent, the strategy found was to hold the channel until somebody else attempts to transmit (a packet is dropped - for those who are still awake) and then release the channel for the other agent.

The Optimal channel utilization was achieved when each agent was rewarded privately.

For further reading here is a research regarding this example and a lecture based on this research by the same guy. I bulit this section based on the latter.

Wrap Up

ML Areas

a

ML algorithms cheat sheet

a

Matching between meta data and model

a